30 research outputs found

    Corpus-based automatic detection of example sentences for dictionaries for Estonian learners

    Get PDF
    VĂ€itekirja elektrooniline versioon ei sisalda publikatsiooneNĂ€itelause tĂ€idab sĂ”nastikus kindlat eesmĂ€rki, aidates aru saada sĂ”na tĂ€hendusest ja illustreerides sĂ”na erinevaid kasutuskontekste. NĂ€itelausete pĂ”hiallikas on mahukas tekstikorpus, kust aga kĂ€sitsi on nĂ€itelauset leida vĂ€ga keeruline. Elektroonilise leksikograafia arenguga on Eestisse jĂ”udnud mitmed töövahendid, mis aitavad automaatselt tuvastada eri sĂ”nastike jaoks vajalikku infot, sealhulgas nĂ€itelauseid. VĂ€itekirjas uuritakse, missugused parameetrid iseloomustavad Eesti Keele Instituudis koostatud sĂ”nastike "Eesti keele sĂ”naraamat 2019", "Eesti keele pĂ”hisĂ”navara sĂ”nastik 2014", "Eesti keele naabersĂ”nad 2019" nĂ€itelauseid ning "Eesti keele A1−C1 Ă”pikute korpuse 2018" lauseid. Uurimuse eesmĂ€rk on vĂ€lja töötada meetod, mis vĂ”imaldab neid parameetreid arvestades korpusest automaatselt tuvastada eesti keele Ă”ppijatele sobivaid lauseid. Töö keskmes on reeglipĂ”hine lĂ€henemine, mida rakendatakse korpuspĂ€ringusĂŒsteemi Sketch Engine integreeritud tööriista GDEX ehk Good Dictionary Examples nĂ€itel. Parameetrite hÀÀlestamiseks on osaliselt kasutatud ka masinĂ”ppe elemente. SĂ”nastiku nĂ€itelausete ja Ă”pikulausete analĂŒĂŒs nĂ€itas, et hea eesti keele nĂ€itelause peab olema tĂ€islause ja vastama muuhulgas jĂ€rgmistele parameetritele: on 4–20 sĂ”net pikk; ei sisalda sĂ”nesid, mis on pikemad kui 20 tĂ€hemĂ€rki; ei alga teatud sĂ”naliikidega (nt sidesĂ”naga) ega tagasi viitavate sĂ”nade (nt sellepĂ€rast) vĂ”i sĂ”napaaridega (nt sellisel puhul); ei sisalda vulgaarseid ja halvustavaid sĂ”nu, madala sagedusega sĂ”nu jmt. Uurimuse tulemusena on loodud "Eesti keele Ă”ppekorpus 2018 (etSkELL)", mis sisaldab ainult vĂ€lja töötatud parameetritele vastavaid lauseid. Õppekorpus on omakorda aluseks eesti keele Ă”ppekeskkonnale Sketch Engine for Estonian Language Learning ehk etSkELL ja veebilausetele Eesti Keele Instituudi keeleportaalis SĂ”naveeb.The function of an example sentence in a dictionary is to help the reader understand the meaning of the headword and illustrate its contexts of use. Nowadays, the main source of example sentences is a large text corpus, where suitable sentences are hard to find. Luckily, e-lexicography has generated automatic tools to help detect various information for dictionaries, including example sentences. The dissertation examines certain parameters of the example sentences presented in the Dictionary of Estonian (2019), Basic Estonian Dictionary (2014), Estonian Collocations Dictionary (2019), and Estonian Coursebook Corpus (2018); all four were compiled at the Institute of the Estonian language. The aim of my study is to elaborate an automatic method using parameters which identify sentences suitable for learners of Estonian. To that end, a rule-based approach was applied to the example of Good Dictionary Examples (GDEX) integrated in the Sketch Engine corpus query tool. Machine learning elements were also adopted to fine-tune the parameters. According to the analysis of the example sentences used in the dictionaries and coursebook sentences, a good Estonian example sentence should be a full sentence meeting, inter alia, the following parameters: length 4–20 tokens; no tokens longer than 20 characters; never begins with certain parts of speech (e.g., conjunction) or an anaphoric word (e.g., sellepĂ€rast ‘this is why’) or word pair (e.g., sellisel puhul ‘in such a case’); and vulgar or disparaging words, rare words, etc., are excluded. The study resulted in the compilation of the Estonian Corpus for Learners 2018 (etSkELL), which contains no other sentences but those corresponding to the developed parameters. The corpus, in turn, serves as the basis for the corpus-based web tool Sketch Engine for Estonian Language Learning (etSkELL) and the web sentences in the language portal SĂ”naveeb of the Institute of the Estonian Language.https://www.ester.ee/record=b530293

    Eesti keele kui teise keele Ă”pikute lausete analĂŒĂŒs ja selle rakendamine eri keeleoskustasemete sĂ”nastike nĂ€itelausete automaatsel valikul

    Get PDF
    Artikli eesmĂ€rk on vĂ€lja töötada korpuspĂ€ringusĂŒsteemi Sketch Engine heade nĂ€itelausete tööriista GDEX (Good Dictionary Example) eesti mooduli versioonid, mis aitavad korpusest tuvastada eri keeleoskustasemetele vastavaid eri leksikaalse, sĂŒntaktilise ja grammatilise keerukusega nĂ€itelause kandidaate. Selleks analĂŒĂŒsin eesti keele kui teise keele Ă”pikute lauseid ning teen kindlaks, missugused parameetrid eri keeleoskustasemeid eristavad. Uute eesti mooduli versioonide aluseks on sĂ”nastike nĂ€itelausete analĂŒĂŒsi pĂ”hjal loodud GDEX-i eesti mooduli versioon 1.4, mille parameetreid vastavalt Ă”pikulausete analĂŒĂŒsi tulemustele kohandan. Uurimistöö tulemusi rakendades saab luua eri keeleoskustasemete Ă”ppekorpused, mis sobivad kasutamiseks sĂ”nastikuportaalides (nt SĂ”naveeb), keeleĂ”pperakendustes (nt etSkELL) ja muu Ă”ppevara loomisel. *** "Analysis of CEFR-graded coursebook sentences and their use for automatic detection of good dictionary examples" The aim of the study was to develop new Estonian GDEX configurations for A-, B- and C-language proficiency levels. GDEX (Good Dictionary Example) (Kilgarriff et al. 2008) is a software module of the corpus query system Sketch Engine (Kilgarriff et al. 2004), which helps to identify good dictionary example candidates from large corpora. In order to identify which specific parameters characterise sentences in each proficiency level, full sentences from the Estonian Coursebook Corpus 2018 were analysed using a program called Analyser of Sentence Parameters developed at the Institute of the Estonian Language. The analyser allows to find out how long the sentences and tokens are, what kind of verb forms are used, what syntactic properties the sentences have etc. The analysis showed that compared to the latest Estonian GDEX configuration 1.4 such parameters as sentence and token length, occurrence of certain verb forms and parts of speech needed to be adjusted. Accordingly, for A-level the sentence length was set to 3–14 tokens (optimal interval 4–7 tokens), for B-level 3–18 tokens (optimal interval 4–12) and for C-level 4–23 tokens (optimal interval 6–14 tokens). A new classifier that penalises tokens longer than 9 characters on A-level and tokens longer than 11 characters on B-level was introduced. On A- and B-levels certain verb forms were penalised or banned from appearing in the sentence. etSkELL – a corpus tool for Estonian language learning – and the dictionary portal SĂ”naveeb (Wordweb) are introduced as possible ways to implement the new GDEX configurations output. The results of this paper can be applied in compiling corpora and teaching materials for different language proficiency levels

    A comparison of collocations and word associations in Estonian from the perspective of parts of speech

    Get PDF
    The paper provides a comparative study of the collocational and associative structures in Estonian with respect to the role of parts of speech. The lists of collocations and associations of an equal set of nouns, verbs and adjectives, originating from the respective dictionaries, is analysed to find both the range of coincidences and differences. The results show a moderate overlap, among which the biggest overlap occurs in the range of the adjectival associates and collocates. There is an overall prevalence for nouns appearing among the associated and collocated items. The coincidental sets of relations are tentatively explained by the influence of grammatical relations i.e. the patterns of local grammar binding together the collocations and motivating the associations. The results are discussed with respect to the possible reasons causing the associations-collocations mismatch and in relation to the application of these findings in the fields of lexicography and second language acquisition

    State-of-the-art on monolingual lexicography for Estonia

    Get PDF
    The paper describes the state of the art of monolingual lexicography in Estonia. Firstly, we describe the current situation in Estonia and the main public functions performed by the Institute of the Estonian Language. Secondly, we provide an overview of the primary types of monolingual academic dictionaries (dictionaries of Standard Estonian and explanatory dictionaries) published in Estonia since the 20th century. Monolingual learner’s lexicography has emerged as a new field in the 2010s, focusing on basic vocabulary and collocations. Thirdly, we give a short overview of accessibility policy and availability of language resources for Estonian. Finally, we envisage the future work in the field of lexicography in the Institute. Within the framework of the new dictionary writing system Ekilex the Institute is moving away from presenting separate interfaces for different dictionaries towards a unified data model in order to provide the data in the aggregated form

    D3.8 Lexical-semantic analytics for NLP

    Get PDF
    UIDB/03213/2020 UIDP/03213/2020The present document illustrates the work carried out in task 3.3 (work package 3) of ELEXIS project focused on lexical-semantic analytics for Natural Language Processing (NLP). This task aims at computing analytics for lexical-semantic information such as words, senses and domains in the available resources, investigating their role in NLP applications. Specifically, this task concentrates on three research directions, namely i) sense clustering, in which grouping senses based on their semantic similarity improves the performance of NLP tasks such as Word Sense Disambiguation (WSD), ii) domain labeling of text, in which the lexicographic resources made available by the ELEXIS project for research purposes allow better performances to be achieved, and finally iii) analysing the diachronic distribution of senses, for which a software package is made available.publishersversionpublishe

    National identity predicts public health support during a global pandemic

    Get PDF
    Changing collective behaviour and supporting non-pharmaceutical interventions is an important component in mitigating virus transmission during a pandemic. In a large international collaboration (Study 1, N = 49,968 across 67 countries), we investigated self-reported factors associated with public health behaviours (e.g., spatial distancing and stricter hygiene) and endorsed public policy interventions (e.g., closing bars and restaurants) during the early stage of the COVID-19 pandemic (April-May 2020). Respondents who reported identifying more strongly with their nation consistently reported greater engagement in public health behaviours and support for public health policies. Results were similar for representative and non-representative national samples. Study 2 (N = 42 countries) conceptually replicated the central finding using aggregate indices of national identity (obtained using the World Values Survey) and a measure of actual behaviour change during the pandemic (obtained from Google mobility reports). Higher levels of national identification prior to the pandemic predicted lower mobility during the early stage of the pandemic (r = −0.40). We discuss the potential implications of links between national identity, leadership, and public health for managing COVID-19 and future pandemics.publishedVersio

    Predicting attitudinal and behavioral responses to COVID-19 pandemic using machine learning

    Get PDF
    At the beginning of 2020, COVID-19 became a global problem. Despite all the efforts to emphasize the relevance of preventive measures, not everyone adhered to them. Thus, learning more about the characteristics determining attitudinal and behavioral responses to the pandemic is crucial to improving future interventions. In this study, we applied machine learning on the multinational data collected by the International Collaboration on the Social and Moral Psychology of COVID-19 (N = 51,404) to test the predictive efficacy of constructs from social, moral, cognitive, and personality psychology, as well as socio-demographic factors, in the attitudinal and behavioral responses to the pandemic. The results point to several valuable insights. Internalized moral identity provided the most consistent predictive contribution—individuals perceiving moral traits as central to their self-concept reported higher adherence to preventive measures. Similar results were found for morality as cooperation, symbolized moral identity, self-control, open-mindedness, and collective narcissism, while the inverse relationship was evident for the endorsement of conspiracy theories. However, we also found a non-neglible variability in the explained variance and predictive contributions with respect to macro-level factors such as the pandemic stage or cultural region. Overall, the results underscore the importance of morality-related and contextual factors in understanding adherence to public health recommendations during the pandemic.Peer reviewe

    National identity predicts public health support during a global pandemic (vol 13, 517, 2022) : National identity predicts public health support during a global pandemic (Nature Communications, (2022), 13, 1, (517), 10.1038/s41467-021-27668-9)

    Get PDF
    Publisher Copyright: © The Author(s) 2022.In this article the author name ‘Agustin Ibanez’ was incorrectly written as ‘Augustin Ibanez’. The original article has been corrected.Peer reviewe
    corecore